---
title: "Extraction Workers"
type: concept
created: 2026-04-18
updated: 2026-04-18
sources: ["raw/articles/06-reading-telemetry.md", "raw/notes/memory.md"]
tags: [curriculum-mapper, extraction, workers, detached-process]
---

# Extraction Workers

Extraction Workers are the background processes inside the [[Curriculum Mapper]] that parse curriculum documents, send them to GPT-4o for structured extraction, and persist the results as `curriculum_objective` records.

## Architecture: Detached Child Processes

Extraction workers are spawned as **detached child processes** — they continue running even if the CM server restarts or crashes. This is critical because extraction runs can take minutes to hours for large territories.

```mermaid
flowchart LR
    UI["CM Dashboard\nAdmin triggers extraction\nfor territory+subject+year"] --> CM["CM Server :3117"]
    CM -->|"spawnExtraction()"| W1["Worker Process\n(detached child)\nPID: 12345"]
    CM -->|"spawnExtraction()"| W2["Worker Process\n(detached child)\nPID: 12346"]
    CM -->|restart/crash| CM2["CM Server :3117\n(restarted)"]
    W1 -->|"continues running"| W1
    W2 -->|"continues running"| W2
    W1 -->|"writes status"| DB["curriculum_objectives\nDB"]
    W2 -->|"writes status"| DB
    CM2 -->|"reads status"| DB
```

The detached architecture means:
- Extraction is not coupled to CM server uptime
- Multiple extractions can run in parallel (one per territory/subject/year tuple)
- Server reload (e.g., during deploy) does not abort in-progress extractions

## spawnExtraction()

The core function that initiates an extraction run:

```typescript
spawnExtraction(
  territory: string,    // "Turkey", "England", "Norway"
  subject: string,      // "English", "Mathematics", "Science"
  yearLevel: number,    // 1–13
  extractionId: string  // UUID — used for status tracking
): void
```

This function:
1. Spawns a `child_process` with `detached: true` and `unref()` — process persists after parent exits
2. The worker process receives the extraction parameters via IPC or environment variables
3. The worker reads curriculum source documents from the CM's raw document store
4. For each document chunk: calls GPT-4o with a structured extraction prompt
5. GPT-4o returns JSON matching the `curriculum_objective` schema
6. Worker validates and INSERTs into `curriculum_objectives` table
7. Worker updates extraction status record on completion or failure

## Status Tracking

Each extraction run writes status to the DB so the CM dashboard can poll progress without re-querying the worker process:

| Status Field | Values | Notes |
|---|---|---|
| `state` | `running`, `completed`, `failed` | Updated by worker |
| `objectives_extracted` | INT | Incrementing count |
| `objectives_total` | INT (estimated) | Set at spawn time |
| `error_log` | JSON | Failures per document chunk |
| `started_at` | DATETIME | Worker spawn time |
| `completed_at` | DATETIME | Set on completion |

The CM dashboard polls `GET /api/v1/extractions/:extractionId` to display a progress bar. Because the worker is detached, status is always read from DB — never via process signal.

## Turkey Example: Scale in Production

The Turkey extraction is the largest completed run:

- **5,366 unique objective documents** extracted
- Covers all subjects and year levels in the Turkish national curriculum
- Multiple parallel extraction workers ran for different subject/year combinations
- Total time: multiple hours (exact time not recorded)

After extraction: Turkey has 3,692 records in the current `curriculum_objectives` table (some documents produced multiple objectives; others were filtered as duplicates or non-extractable).

England extraction was smaller (85 objectives) and took significantly less time.

## Failure Handling

Extraction workers use a per-chunk retry pattern:

1. Each curriculum document chunk is processed independently
2. If GPT-4o call fails: retry up to 2× with exponential backoff (2s, 8s)
3. If chunk still fails after retries: log to `error_log` in the extraction record, skip chunk, continue
4. If the worker process crashes entirely: the extraction record remains in `running` state indefinitely — **there is no auto-restart**. Admin must manually re-trigger.

⏳ **DECISION-XX:** Auto-restart of crashed extraction workers is not implemented in v1. Add monitoring / dead-worker detection in Phase 3.

## Triggering a New Extraction

Extractions are triggered from the CM admin dashboard. There is no public API for triggering extractions — it is an admin-only operation.

**Prerequisite:** Source curriculum documents must be available in the CM's document store. Currently documents are loaded manually by the team.

**Idempotency:** Re-running an extraction for an already-extracted `(territory, subject, yearLevel)` will produce duplicate records unless the existing objectives are cleared first. The admin UI warns about this but does not auto-deduplicate.

## Related Pages

- [[concepts/curriculum-mapper/index|Curriculum Mapper]] — overview
- [[concepts/curriculum-mapper/Lesson Book Pipeline]] — what happens after objectives are extracted
- [[entities/Curriculum Mapper]] — service entity page
